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performance and relative benefits of each model were then evaluated. 
Writing samples were both expository and narrative. Data were fror 
statewide assessments of secondary school students' writing ability 
for 1985 through 1988, for a total of 2^000 examinees. An examinee's 
four sairples were randomly give;* to a team of 80 to 100 trained 
raters. Results indicate that both model:; were useful for the 
calibration of writing samples. For this item set, the GR model 
provided more information than did the PC model for both the rating 
scales examined. In some cases, one might prefer the PC model because 
of the fewer parameters to estimate and the minimal gains to be 
expected by using the GR model in this context. It is possible, if 
data collection is structured appropriately, to perform an interrater 
agreement analysis through the use of item or test information 
functions. The advantages of item response theory methods may be 
realized with essay-type examinations* Eleven graphs are provided. 
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BEST nOPY AVA'IADir 



A Comparison of ihc Graded Response and Partial Cicdil Models Tor 
Assessing Wniing Abilily' 

RJ. Dc Ayala. University of Maryland 
B.G Dodd&WR Kv-h. University ofTexas 

The ai.<essmenl of writing proficiency may oc accomplished cither 
through direct method* (i.e.. the actual demonstration of the skill) or by 
indirect meihods (e.g.. through the use of objective exams). Interest in the 
direct measurer^icnt of writing proficiency has been growing and half of all 
statewide writing assessment programs now rely exclusively on writing 
samples; the remaining programs use both direct and indirect assessments 
(f»liggins A Bridgeford. I983>. This study was concerned with the direct 
method of assessing writing proficiency. 

The direct assessment of writing proHciency may be performed either 
through the holistic approach or the analytic method. In the analytic scoring 
method the ideal answer is decomposed into a set of components each of wiiich 
is assigned a specific number of points. Assessment requires that the rater 
evaluate the examinee's writing with respect to each component and assign 
pomts according to the perceived quality of the writing on a given compo- 
nent. The examinee's score is a linear composite of his or her component 
scores. 

In contrast, the holistic technique requires that the rater evaluate each 
writing sample with respect to an ideal artswer or standard; multiple standards 
which arc tcss than the ideal answer and which vary across the quahty 
continuum may be used as additional standards. The examinee's score is 
typically a single score which is an assessment of the rater's global Impression 
of the quality of the written piece with respect to the standardCs). 

' Paper presented at the Annual Meeting of the National Council on Me&surc- 
ment in Education, San Francisco, March. 1989. 
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!■ order to improve the reliability of the writing auessment. multiple 
raters may be used to evaluate the same writing sample and tneir ratings 
pooled to form a single rating (e.g.. the examinee's score is the average or the 
sum of the raters' ratings). A more complete discussion of the issues involved 
in the direct assessment of writing ability may be found in Breland (1983). 

The above methods have historically been and currently are approached 
primarily through classical test theory. A V;w researchers have recently 
begun to approach writing assecfment using iiem response theory (IRT) 
models (e.g.. Pollitt A Hutchinson. 1987; Ackerman. 1986). However, although 
tne results of these studies were encouraging, the comparative advantages and 
disadvantages of the IRT models used could not be usessed due to methodolo- 
gical differences. For instance, although both studies assesMd writing 
proflciency by the analytical method. Ackerman's method used five compo- 
nents (e.g.. paragraph development, spelling), whereas the Pollitt and 
Hutchinson study used a different set of three components (e.g.. appropriacy. 
idcu). 

An additional difference between the studies was Ackerman's use of one 
expository question decomposed into 15 'items' (i.e.. 3 raters X 5 components), 
whereas the Pollitt and Hutchinson study used five writing tasks rated on 
three components to produce 15 'items'; both studies used secondary school 
students and their teachers as the raters. In each study the 'items' were 
considered to be locally independent. 

This study fitted Samejima's (1969) graded response (GR) model and Masters' 
(1982) partial credit (PC) model to identical writing samples which were 
holistically scored and then evaluated the performance and relative beneOls 
of each model. Further, the writing samples used consisted of two classes of 
questions : expository and narrative. It was felt that the expository questions 
would be found to provide more information for higher ability examinees than 
the narrative items would because the expository Items appeared to require 
greater discourse competence than did the narrative items; discourse 
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competence It defined as the Klectkm and ttracturing of ideas with respect to 
the purpose of the writing laik and the needs of tHe reader (P^litt and 
Hutcliinson. I9t7). A third factor investipted was the influence of two 
different aictbods of pooling the raters' ratings on item parameter estimaiion. 
Model DcscrlptlvRs 

The two polychotomoas models, the GR and PC. are appropriate for items 
with ordered lespcnses. snch as attitude questionnaires and aptitude or 
achievement lest items whose alternatives are inherently ordered o: h«ve 
been ordered according to degree of correctneu (e.g.. through partial credit 
acoring). In addition, ratings data (e.g.. ratings of writing samples) may also 
be fitted by either model. 

The OR RKMlel it a direct extension of the two*pararoeter model. As a result, 

the OR model contains a parameter which allows an assessment of an item's 

capacity to discriminate among examinees. In the OR model the examinee 

responses lo Item i are categorised into mt ♦ I categories, where higher 

categories bidicate more of 0 and mt Is the number of categories. Associated 

with each category of item i is a category score, xi. with values 0..mi. The OR 
« 

model may be exprtsaed as : 

^DaKO-bK,) 

Px|(e)« (1). 

, ^ jyuiiB - b,,) 

where 0 Is the latent trait. a| is the discrimination parameter for Mem i. bx| is 
the difficulty parameter for category tcore x for item i. and the scaling 
eonstani D equals V702. Px| is the probability. px|. of the examinee responding 
in category score xi or higher for a given item; the probability of responding 
in the lowest category (i.e.. 1^(0)) or higher is defined as 1.0. For instance, for 
an item wHh four response categories P2(0) is the probability of responding 
in categories 2 or 3 rather than in categories 0 or I. Because Pxi l> the 
probalility of responding In xt or higher, the probability of responding in a 
particular category equals the difference between cumulative probabilities 
for adjacent categories (e.g.. P2(0) « P2(^) ' P3(0»- When an item consists of 
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two categories (correct and inconect), the OR model reduces to the two- 
parameter model. 

In contrast to the GR model, the PC model provides a direct expression of the 
probability of an examinee with ability 0 responding In a particular category. 
In the PC model the examinee-item interaction is modeled as : 
«i 

i(e bxi) 

Pxi(e)= (2). 

k 

in J !(• -b^j) 

where 6 is the latent trait, bxj is the difficulty parameter of the step associated 
with category score Xj of item I with m| categories, where XjsL.mj. A category 
score reflects the number of successfully completed steps. A "step** is simply a 
stage required to complete an item. For instance, the problem ((6/3)4^2)^ is 
considered to contain three steps beesusc there arc three separate stages 
which must be completed (In a specific order) to correctly answer the problem 
(i.e., step I : 6/3. step 2 : the addition of 2 to the quotient, and step 3 : the 
squaring of the quantity). For notational convenience £(0 - bj||) where j^O is 
defined as being equal to zero. 

Because the PC model is an extension of the Rasch model it assumes that all 
ilems are equally good at discriminating among examinees. In addition, as a 
member of the Rasch family, the PC model's item and person parameters may 
be estimated on the buis of the existence of sufficient statistics. Specifically, 
an examinee's test score contains all the infonnation for estimating his or her 
ability and the items' difficulties may be estlmaied from a simple count of the 
number of persons completing eacf^ "step" of an hem. Unlike the OR model, 
the PC model requires that the steps within an item be completed in sequence, 
alt^ough the steps need not be equally difficult nor he ordered In lenns of 
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diflicully. If an iicm contists of only two cMegories. ihcn the PC model 

reduces lo the Rascii model. 

Mcthotf 

fim : The daia came from a siaic-wide assessment of secondary school siudcnis* 
wriling abilily. This lesi has been given annually since iis inception in 1984 
•ml is required for graduation from high school. Escept for the 1984 admin- 
isi ration, the writing tesi consisted of four writing samples, two of which were 
expository in nature and the two remaining items were narrative; the 1984 
tdini.-<;uration contained one expository and one narrative item. Two of the 
four items used in the 1985 to 1988 administrations were from the 1984 testing. 
These two items appeared in all administrations and served as a link' acioss 
administrations so that the separate testings could be placed on the same scale. 

An examinee's four wriii^.j; samples were randomly given to a team of 80 to 
100 specially trained raters. The ratio of writing samples to individual rater, 
made it very unhkely ibat the same rater would rate more than one writing 
sample by a given examinee. 

Each writing sample was holisticalty scored by two raters on a I to 4 scale; 
items with 0 scores indicated that the writing sample could not be scored (e.g., 
the student did not provide an answer) and were eliminated from analysis. If 
the two raters' ratings disagreed by more than two points a third rater was 
uicd. Exact interrater agreement occurred on approximately 76% of the 
ratings and periodic "check packets" (i.e.. pre-scored writing samples) were 
distributed to monitor any drift in the ratings. 

Ru i ng Tm : Because each writing sample had at least two ratings the impact 
of usmg a simple sum raiinc versus an average rating was investigated. The 
former method consisted of the sum of the raters' ratings (in the case of three 
raters the two closest ratings were used) which was then transformed to a 0 to 
6 range (a.k.a.. the sum rating Kale). That is, the sum of ihc two ratings 
(range 2 to 8) was transformed by simply recoding a rating of 2 to 0. a rating 
of 3 to 1, etc. The second approach was to round the average of the two ratings 
to the nearest integer, in this Matter method the original readers' 1 lo 4 ratmgs 
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were receded to 0 to 3 ((.e., a rating of 1 was receded to 0, a rating of 2 was 
recoded to I. etc.) before calculating the average and rounding to the nearest 
integer, this method will be known as the average rating scale. 
Calihraiion : The MULTILOG 5.1 (Thissen. 1988) calibration program was used to 
fit both the OR and ibe PC models to the ten item pool. The use of a single 
calibration program for both models controlled for differences in the 
implementation of estimation algorithms when different calibration programs 
are used. Although MULTILOG provides for direct specification of the GR 
model, obtaining item parameter estimates for the PC model is not direct. 
Estimates for the PC model were obtained by imposing triangular contrasts on 
Bock's (1972) nominal response (NR) model (cf.. Thissen A Steinberg. 1986). 
Imposing these triangular contrasts on the NR model is the logical equivalent 
of making the a priori order assumption necessary for the PC model (Thissen. 
1988: Masters & Wilson, 1988). 

A total of 9652 examinees with usable response strings were obtained Irom 
the four administrations; 1985 : N« 2264. 1986 : N» 3026. 1987 : W> 3002. and 1988 : 
Ns 1360. Because of practical and financial considerations the entire data set 
could not be used and a random sample of 5(X) c«iarainees with no non- 
responses was obt aired from each administration. Therefore, item parameters 
for the 10 items were obtained on the responses of 2000 examinees. Because 
MULTILOG implen.^nts a marginal maximum likelihood estimation algorithm 
the item and person parameters are estimated separately. Therefore, the small 
number of items relative to tht calibration sample's size was not problematic. 

The five annual examinations were placed on the same Kale by linking the 
separate exams through the common (1984) items and performing a 
simultaneous calibration of the four administrations. The crossing of IRT 
model (PC vs. OR) by nting type (i.e.. sum vs. average) produced a 2 X 2 design 
with one calibration per cell. 

Analysis : Analysis consisted of an examination of operating characteristic 
curves (OCC), item and test information functions. For each rating type the 
relative efficiency of each of the 10 items when calibrated using the PC model 
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wu compared to liiote of the iiema when calibrated using the GR model; the 
same appioacli wii vied for comparing the annual adminiiiraiions. For each 
model Ike relative cmciencies using ibe aum rating was contrasted with the 
•verage raiiag. Fuitber, the relative emcienciea of the expository itema with 
respect to Ike narrative itema were examined for each model. 
ReaaZta aad Coaclaalon 

Both the IPC and GR modela could not sstiafactorily fit the 1988 administra- 
lion*a narrative item. Therefore, for the following presentation this item aa 
well aa the expository item for ihia testing were eliminated. 

Despite the fact that the esama were not developed atilizing a target infor- 
maiion function, the inapcction of teat informat^ona acroaa administrations 
revealed that, although the fanctiona were not identical, they were relatively 
aimllar for the PC model regardless of rating icale used (Figures I and 2). Some 
adminiairations provided more information in panicolar 6 rangea than 
others, hut, ia general, the esama appeared to be measuring ability with 
relatively the ume degree of accuracy. In contrast, for ihr OR model one can 
see ffon Figures 3 and 4 that while the information functiona for the 1984- 
1986 administrations wer*. very similar, the 1987 testing yielded greater 
informaiSon than the other adminiat*ationa. regardleaa of rating scale uaed. 
In addition, unlike the other adminiatrations the 1986 testing yielded greater 
information la the approximate 6 range of -1.0 to 1.5 based on the average 
rating scale than it did uaing the aum rating acale. 

Insen Figures I lo4 about here 

The relative efflcienciea of the two rating methoda for each model are 
presented in Figure i. Aa can be seen from this figure, for both models the 
aum rating type provided greater information than the average rating scale. 
The additional categoriea in the aum rating acale (relative to the kverage 
rating acale) provide greater iniormation than the more restricted average 
rating scale. In general, the 0 range encompassed by the difficulty 
parameters based on the aum rating scale was larger than for the average 



3 

ERIC 



8 



rating scale. For example, for the GR model the avernge rating scale's 
difficulty parameter, b^. bad estimates which (roughly) corresponded in 
magnitude to those of b^ using the sum rating scale (6 difficult parameters). 
Similarly, for the PC model based on the average rating scale the step 
difficulty estimates for a score of 3 were between the sum rating scale's step 
difficulties of b^ and t^^. 

Insert Figure 5 about here 

As can alto be seen from Figure 5. the difference between the PC model's 
information functio-^is based on the sum r id average rating scales was 
substantial. It was expected that the sum rating scale would provide more 
information than the average rating scale given that, for the PC model, four- 
step itema have been found to yield more total information across the 0 
continuum than three-atep items (Dodd A Koch. 1987). However, for the GR 
model the difference between the test information functions based on the two 
scalea wu not as dramatic. Thia lack of a substantial difference between the 
information functiona for the rating scales mr.y be due to the GR model's use of 
a diKrimination parameter. That is. to a certain extent large &s for items 
scored using ihe average scale may compensate for the sum rating scale's 
additional categories. For the average rating scale the mean ± was 2.285, 
whereaa for the sum rating scale the average g. was 7.161. Although iwn-ihirds 
of the average rating scale items had &s which were larger than the 
corresponding aum rating scale items' &s, this increase in the average rating 
acale's mean discrimination was primarily the result of three items where the 
differences were 0.30. 0.37, end 0.51. In those cases where the average rating 
scale itema' ^ were less than those of the sum rating scale, the differences 
wert 0.19, 0.04, 0.04. 

In addition. Figure 5 shows that the GR model using the sum ratings pro- 
vided more information than did the PC model, regardless of rating scale used 
with the PC model Differences between models and rating scales for 6 2 3,0 or 
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Previovs (clasticat imi theory) research with scoring scales has mdicaled 
thai larger scale ranges produce higher reliabililiet than smaller scale ranges 
(e.g.. Cofrmaa. 1971: Oodshall^. Swinefotd. if Cofrman. 1966). Resuhs from this 
study Indicated thai for either model the larger rating scale (sum holistic 
nting) yielded greater information than did the smaller rating Kale (average 
' holistic ralfaiB). 

For the PC model using the sum rating Kale there were three reversals (out 
of sequence of bs) for all itrms. Specifically, an ordering of the step 
difTicuhiefl found the following relationships :b2<bi<h4<b3<b6<b5. One 
poteibk interpreiaUon of this finding it that given a rating of. t.g.. a 3 (e.g.. 
based on the b2/b| reversal), it was "easier* or more likely for the examinee to 
be given a raihig of. e.g.. a 4; the same logic may be applied to the other 
reversals. Consistent with this interpretation the PC modePs OCC for the 1985 
administratlon*a narrative item (Figure 10) showed that certain rating Kores 
were not as probable v:. others. As can be seen from this figure, the 
probability of obtaining a rating score of 3 (category score of 1) was far less 
than that of raw scores of 2 or 4 (category scores of 0 or 2. -nspectively). 
Shnilarly. rating scores of 5 or 7 were 'not as likely to be given as Kores of 4 
and 6 or 6 miJ 8. respectively. 

hiKit Figure 10 about here 

With reiipect to the OR model and as would be expected, the dii.iculty 
parameters for the sum rating Kale fitted by the OR model did not exhibit any 
reversals. That is. the OR model's difficulty parameters correspond to the 
points of inficction for the category characteristic curves; a Mt of ogival 
shape curves specifying the cumulative probabilitiea of responding in one set 
of one or mote categories versus another Kt of one or more categories. As 
stated above, in order to obtain the probaOillty of responding in a particular 
category the cumulative probabilities for adjacent categories must be 
subtrav ed. Thir fKt implies that the difficulty parameters associated with an 
item's options must be ordered. 

-11 
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Although the OR model does not allow reversals to occur In Its difficulty 
parameters, the modeKs OCCs for the 19S5 administration's narrative item 
ihowed a striking similarity lo the PC models for the same Item (Figure 11); a 
comparison of Figures 10 and II showed subtle differences between corre- 
sponding category Korts (e.g.. 0. 1. 5. and 6). In fKt. If the GR modcrs dif- 
ficulty parameters were defined as those of the PC model, then reversals would 
have occurred for the GR model as well. 

Insert Figure 1 1 about here 

B^ause the interpretation of the GR moders OCCs parallels that of the PC 
model's ii appears that, hi effect, for both models the 7-potnt rating Kale 
became functionally a 4 point Kale. 

This final analysis demonstrated how either model may be used for a rating 
scale analysis. Unlike classical techniques, this method is ba^d on parameter 
estimates which are not sample dependent. In addition, although this study's 
datn collection method prrr|uded an analysis of interraier agreement, it is 
possible, if data collection is structured appropriately, to perform an inter- 
rater agreement anal/sis through the use of item or lest information func- 
tions; if desired, hems may be grouped to form "concept" information func- 
tions and the Interraier analysis may be performed on a concept basis. In this 
regard, one could not only determine the degree of interraier agreement, but 
also where there was a iKk of agreement. || is felt that the advantages of IRT 
methods may be realized with essay-type examinations. 



12 



6 ^ 'S.O Mt BOI cofiiidsfcd meaningful given Ihc minimal amount of 
liiforaialioB iMOvided by cacb model in ibcK ability regiona. In general, the 
interaction of relatively large diKriminction paiainetera (with retpect to tbe 
•ssumed constant a value in ihe PC model) and a wider 8 range encompassed by 
tbe OR difTiculty parameters than ^ the PC models, resulted in the OR model 
providing greater information than 4 the PC mcniel. The OR model's test 
information functior based on the sum rating provided more information 
than tbe PC model's test information function with tbe same rating scale. 

Figure 6 shows that for tbe mi^rity of tbe 0 continuum the expository 
items provided greater information than the narrative items when tbey were 
fitted by the PC model. An item analysis by administration showed that, except 
for tbe 1913 administration (-0.75 ^ 0 ^ 2.0), as 0 increased the expository items 
provided more information than did tbe narrative items. These relationships 
are presented in Figure 7. As can be seen, for higher ability examinees 
esposilory items provided increases in information of as much as P/^ tim * 
that of narrative items. 

Inaeti Rguies 6 and 7 about beie 

In contrar to the PC moders results, expository items fit.ed by the OR model 
orly provided information greater than the narrative items for 6s greater 
than approximately 1.0 (Figure An inspection of the relative efficiencies 
for tbe narrative versus expository items per administration (Figure 8) 
revealed that. i;xcept for (as was the case for the PC model) the 1985 testing, the 
pattern of expository items' infomation increasing wUh increasing 6 is also 
evident for tbe OR model, f er, under the OR «n(yet iiic narrative items 
appear to provide more information than the expository items for a larger 
portion of 6 continuum than they did under the PC model. For both models, 
results based on the average rating scale were similar to those of the sum 
rating scale. 

Insert Figure t ibouthcre 



13 



10 



Discassion 

Tbe results of this study appear to indicate that both models are useful for 
the calibration of writing samples. However, for this item set the OR model 
provided more InformAtioo than the PC model for both rating scales. Tbe 
greater information provided by the OR model was primarily a result of a 
discrimination parameter which was allowed to vary across items (1.789 
2.823) and which in all cases was laraer than the assumed constant t value of 
the PC model. In those cases where t l» relatively constant across Items (e,8.. 
fittii8 n OH model and examinia8 tbe item discriminations), one may prefer lo 
use tbe PC model because of tbe fewer parameten to estimate and tbe minimal 
8ains to be expected by usin8 tbe OR model in tbis context. In addition, tbe 
decision to use one model over tbe other may be a result of praguiatic 
constraiau u well u pbilosophical beliefs concerning the estimation of 
parameters other than b (see Wright and Stone (1979) for more information on 
the estimation r* 'troversy). 

Results indicat d that, in general, expositor/ items provided more 
information for hi8her 6s than narrative Items did. This /lndin8 if consistcM 
with tbe nature of these two discourse models. Therefore, for tbe assessment 
of hi8h abilities, items of ^n expository nature would be preferred to narrative 
items. Of course, after a sufTicleatly U>rge item pool is developed and 
calibrated, item information functions may be utilized to construct exams 
according to a tar8et test information faiction and/or which are essentially 
weakly parallel (Samejir \ 1911). For instance. Fi8ur€ 9 contains tbe 
information functions for the most informative test actunlly administered (i.e.. 
the 1987 administration) and a hypothetical four-item test (constructed from 
the 8>item pool used in this study). As can be seen tbe developed test (items : 
1985/1987 expository and 1985/1986 narrative items) provided almost twice the 
information u the 1987 administration in addition to pr>vidin8 more of this 
information over s wider 6 range. 

Insert Figure 9 about here 
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Test Informations for 1984-1988 administrations for PC model 
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Figure 4 

Test Informations for 1984-1987 administrations for GR Model 
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Figure 7 



Relative Efriciencies for Expository vs. Narrative Items 
PC Model Sum Rating Scale 
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Figure 10 
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